{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yUStykVPstO-"
   },
   "source": [
    "Authors: Andreas Haupt, Jannis Kück, Alexander Quispe and Anzony Quispe, Vasilis Syrgkanis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UYchb-zYstPD"
   },
   "source": [
    "# Testing the Convergence Hypothesis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wUf3nUm6stPE",
    "outputId": "763db6dc-5d67-48c3-fca8-af6ca15e1c61"
   },
   "outputs": [],
   "source": [
    "!pip install multiprocess\n",
    "!pip install pyreadr\n",
    "!git clone https://github.com/maxhuppertz/hdmpy.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "_T9cPLQs3iJt",
    "outputId": "6c04814a-adaa-4046-fe91-793c8cc33248"
   },
   "outputs": [],
   "source": [
    "!pip install wget # for data loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bYy0blnFstPG"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import wget\n",
    "import hdmpy\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HMXjmGpDstPG"
   },
   "outputs": [],
   "source": [
    "from pyreadr import read_r\n",
    "\n",
    "sys.path.insert(1, \"./hdmpy\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lJ4tQ_uestPH"
   },
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9S5-dpYIstPH"
   },
   "source": [
    "We provide an additional empirical example of partialling-out with Lasso to estimate the regression coefficient $\\beta_1$ in the high-dimensional linear regression model:\n",
    "  $$\n",
    "  Y = \\beta_1 D +  \\beta_2'W + \\epsilon.\n",
    "  $$\n",
    "  \n",
    "Specifically, we are interested in how the rates  at which economies of different countries grow ($Y$) are related to the initial wealth levels in each country ($D$) controlling for country's institutional, educational, and other similar characteristics ($W$).\n",
    "  \n",
    "The relationship is captured by $\\beta_1$, the *speed of convergence/divergence*, which measures the speed at which poor countries catch up $(\\beta_1< 0)$ or fall behind $(\\beta_1> 0)$ rich countries, after controlling for $W$. Our inference question here is: do poor countries grow faster than rich countries, controlling for educational and other characteristics? In other words, is the speed of convergence negative: $ \\beta_1 <0?$ This is the Convergence Hypothesis predicted by the Solow Growth Model. This is a structural economic model. Under some strong assumptions, that we won't state here, the predictive exercise we are doing here can be given causal interpretation.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7BUqjAIBstPI"
   },
   "source": [
    "The outcome $Y$ is the realized annual growth rate of a country's wealth  (Gross Domestic Product per capita). The target regressor ($D$) is the initial level of the country's wealth. The target parameter $\\beta_1$ is the speed of convergence, which measures the speed at which poor countries catch up with rich countries. The controls ($W$) include measures of education levels, quality of institutions, trade openness, and political stability in the country."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jOnK7_gdstPJ"
   },
   "source": [
    "## Data analysis\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cVnFmY0YstPK"
   },
   "source": [
    "We consider the data set GrowthData which is included in the package *hdm*. First, let us load the data set to get familiar with the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bgEZMfsgstPK"
   },
   "outputs": [],
   "source": [
    "# I downloaded the data that the author used\n",
    "fname = \"https://github.com/CausalAIBook/MetricsMLNotebooks/blob/main/data/GrowthData.rda?raw=true\"\n",
    "growth_read = read_r(wget.download(fname))\n",
    "\n",
    "# Extracting the data frame from rdata_read\n",
    "growth = growth_read['GrowthData']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "daYwWCzAstPK"
   },
   "source": [
    "We determine the dimension of our data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kKjo4g2tstPL",
    "papermill": {
     "duration": 0.025013,
     "end_time": "2021-01-20T08:46:45.109042",
     "exception": false,
     "start_time": "2021-01-20T08:46:45.084029",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The sample contains $90$ countries and $63$ controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nyzB9b7-stPM",
    "outputId": "bd620493-fe2b-4651-dd44-ee76a6566f7c"
   },
   "outputs": [],
   "source": [
    "growth.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VyZvlMA1stPM",
    "papermill": {
     "duration": 0.024124,
     "end_time": "2021-01-20T08:46:45.157510",
     "exception": false,
     "start_time": "2021-01-20T08:46:45.133386",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Thus $p \\approx 60$, $n=90$ and $p/n$ is not small. We expect the least squares method to provide a poor estimate of $\\beta_1$.  We expect the method based on partialling-out with Lasso to provide a high quality estimate of $\\beta_1$.\n",
    "To check this hypothesis, we analyze the relation between the output variable $Y$ and the other country's characteristics by running a linear regression in the first step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wjrs2JvvstPM"
   },
   "outputs": [],
   "source": [
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5MdYGz9FstPM"
   },
   "outputs": [],
   "source": [
    "y = growth['Outcome']\n",
    "X = growth.drop('Outcome', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "onzm4E74stPN"
   },
   "outputs": [],
   "source": [
    "reg_ols = sm.OLS(y, X).fit(cov_type='HC1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DGZ6SjkostPN"
   },
   "outputs": [],
   "source": [
    "# output: estimated regression coefficient corresponding to the target regressor\n",
    "est_ols = reg_ols.params['gdpsh465']\n",
    "# output: HC1 std. error\n",
    "std_ols = reg_ols.HC1_se['gdpsh465']\n",
    "# output: 95% confidence interval\n",
    "lower_ci, upper_ci = reg_ols.conf_int(alpha=0.05).loc['gdpsh465'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tfCnh-HRstPN"
   },
   "source": [
    "## Summarize OLS results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 89
    },
    "id": "a4nXHBrrstPO",
    "outputId": "7d717e5b-8562-4638-915f-10b0b995b202"
   },
   "outputs": [],
   "source": [
    "table = pd.DataFrame(columns=[\"Estimator\", \"Std. Error\", \"lower bound CI\", \"upper bound CI\"])\n",
    "table.loc['OLS'] = [est_ols, std_ols, lower_ci, upper_ci]\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZxPC6gCFstPO"
   },
   "source": [
    "Least squares provides a rather noisy estimate (high standard error) of the\n",
    "speed of convergence, and does not allow us to answer the question\n",
    "about the convergence hypothesis since the confidence interval includes zero.\n",
    "\n",
    "In contrast, we can use the partialling-out approach based on lasso regression (\"Double Lasso\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AVAL-WUxstPO"
   },
   "outputs": [],
   "source": [
    "y = growth['Outcome'].values\n",
    "W = growth.drop(['Outcome', 'intercept', 'gdpsh465'], axis=1)\n",
    "D = growth['gdpsh465'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1XFIq9k0stPP"
   },
   "source": [
    "## Method 1 - Lasso with Theoretical Setting of Penalty using HDMPY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bhkKQO8kstPP"
   },
   "source": [
    "While cross validation is commonly employed for choosing penalty parameters in Lasso, it can be very noisy and tends to choose relatively small penalty leading to some overfitting. For this reason, one should not use cross validation to choose tuning parameters unless sample splitting is employed. We illustrate the use of sample combined with cross validation in later chapters in the book. Since we are using the full sample here, it is much better (and theoretically valid) to use penalties that provably control overfitting, which is what we do here.\n",
    "\n",
    "We report the results using cross validation at the end of this notebook for comparison. There, we observe overfitting for the prediction of the outcome."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LUhRELnwstPP"
   },
   "outputs": [],
   "source": [
    "res_y = hdmpy.rlasso(W, y, post=False).est['residuals']\n",
    "res_D = hdmpy.rlasso(W, D, post=False).est['residuals']\n",
    "\n",
    "r_y = pd.DataFrame(res_y, columns=['r_y'])\n",
    "r_D = pd.DataFrame(res_D, columns=['r_D'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bZQPwxAfstPP"
   },
   "outputs": [],
   "source": [
    "# OLS regression\n",
    "reg_ols = sm.OLS(r_y, r_D).fit(cov_type='HC1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DSNHIS_dstPP",
    "outputId": "edc9d3e7-409f-4517-b2e4-bcafcb979d4b"
   },
   "outputs": [],
   "source": [
    "reg_ols.params['r_D']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 125
    },
    "id": "mIiFjqtestPP",
    "outputId": "759e65ec-89e6-4c29-80e8-ffcf89c585ff"
   },
   "outputs": [],
   "source": [
    "est = reg_ols.params['r_D']\n",
    "stderr = reg_ols.HC1_se['r_D']\n",
    "lower_ci, upper_ci = reg_ols.conf_int(alpha=0.05).loc['r_D'].values\n",
    "table.loc['Double Lasso'] = [est, stderr, lower_ci, upper_ci]\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "j2fVTLyqstPP"
   },
   "source": [
    "The least square method provides a rather noisy estimate of the speed of convergence. We can not answer the question if poor countries grow faster than rich countries. The least square method does not work when the ratio $p/n$ is large.\n",
    "\n",
    "In sharp contrast, partialling-out via Lasso provides a more precise estimate. The Lasso based point estimate is $-5\\%$ and the $95\\%$ confidence interval for the (annual) rate of convergence $[-7.8\\%,-2.2\\%]$ only includes negative numbers. This empirical evidence does support the convergence hypothesis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_jfpXjJistPQ"
   },
   "source": [
    "## Method 2: Using Cross-Validated Lasso"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g0KAUNMYl9Bv"
   },
   "source": [
    "This section is for illustration purposes only. Given that we are using the full sample, cross validation *should not* be used for choosing tuning parameters here. Cross validation tends to (mildly) overfit, and this overfitting can lead to substantial problems when inference about parameters is the goal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-unoL91DstPQ"
   },
   "outputs": [],
   "source": [
    "from sklearn import linear_model\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "# Choose penalty based on KFold cross validation\n",
    "# Given small sample size, we use an aggressive number of 20 folds\n",
    "cv = KFold(n_splits=20, shuffle=True, random_state=123)\n",
    "model_y = make_pipeline(StandardScaler(),\n",
    "                        linear_model.LassoCV(n_alphas=10, cv=cv, max_iter=10000))\n",
    "\n",
    "res_y = y - model_y.fit(W, y).predict(W)\n",
    "r_y = pd.DataFrame(res_y, columns=['r_y'])\n",
    "\n",
    "model_d = make_pipeline(StandardScaler(),\n",
    "                        linear_model.LassoCV(n_alphas=10, cv=cv, max_iter=10000))\n",
    "res_D = D - model_d.fit(W, D).predict(W)\n",
    "r_D = pd.DataFrame(res_D, columns=['r_D'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jp_tzQASstPQ"
   },
   "outputs": [],
   "source": [
    "# OLS regression\n",
    "reg_ols = sm.OLS(r_y, r_D).fit(cov_type='HC1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "QRFkAf_PstPR",
    "outputId": "8d41b88e-13f3-444a-cdc7-1046940802e4"
   },
   "outputs": [],
   "source": [
    "est = reg_ols.params['r_D']\n",
    "stderr = reg_ols.HC1_se['r_D']\n",
    "lower_ci, upper_ci = reg_ols.conf_int(alpha=0.05).loc['r_D'].values\n",
    "table.loc['Double Lasso-CV'] = [est, stderr, lower_ci, upper_ci]\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YC-zlwPRstPR"
   },
   "source": [
    "We find that the outcome model chooses too small of a penalty based on cross-validation, leading to overfitting of the outcome and tiny outcome residuals. This leads to artificially small standard errors and a zero treatment effect. Theoretically driven penalty should be preferred for such small sample sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 887
    },
    "id": "tthzbEkUstPR",
    "outputId": "fb2df2f6-c631-4a6f-fb14-69af3cb6e697"
   },
   "outputs": [],
   "source": [
    "plt.plot(model_y[-1].alphas_, 1 - model_y[-1].mse_path_.mean(axis=1) / np.var(y))\n",
    "plt.title('Outcome Lasso-CV Model: Out-of-sample R-squared as function of penalty level')\n",
    "plt.show()\n",
    "plt.title('Treatment Lasso-CV Model: Out-of-sample R-squared as function of penalty level')\n",
    "plt.plot(model_d[-1].alphas_, 1 - model_d[-1].mse_path_.mean(axis=1) / np.var(D))\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
